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Service 
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Part 1 : Overview 


• • • 
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Motivation 


DATA 


Data Discovery and Exploration 


Search and 
Access 


Visualization 
and Analysis 




Visualization 
and Analysis 



KNOWLEDGE 


• Data preparation steps are cumbersome and 
time consuming 

o Covers discovery, access and preprocessing 

• Limitations of current Data and information 
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o Searches on data are boolean searches on instrument 
or geophysical keywords 

o Underlying assumptions that users have sufficient 
knowledge of the domain vocabulary 

o Lack support for those unfamiliar with the domain 
vocabulary or the breadth of relevant data available 


Earth Science Metadata: 
Dark Resources 


• Dark resources - information resources that organizations 
collect, process, and store for regular business or operational 
activities but fail to utilize for other purposes 

o Challenge is to recognize, identify and effectively utilize these 
dark data stores 

• Metadata catalogs contain dark resources consisting of 
structured information, free form descriptions of data and 
browse images. 

o EOS Clearing House (ECHO) holds 3666 data collections, 127 
million records for individual files and 67 million browse images. 


Premise: Metadata catalogs can be utilized beyond their original 
design intent to provide new data discovery and exploration 
pathways to support science and education communities. 
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Browse Image Example: Understanding 
regional air pollution from haze 



MODIS 2010 
image over 
India which 
shows modest 
level haze 
pollution is used 
to drive the 
search 

How often does 
Haze occur 
over Indian 
subcontinent? 
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Results: Image Retrieval and Metadata 
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Haze occurs more 
frequently in Spring than 
in Summer 

Over half a month in 
January, haze images 
were observed in the 
region 



Goals 


• Design a Semantic Middleware Layer (SML) to 
exploit these metadata resources 

o provide novel data discovery and exploration capabilities 
that significantly reduce data preparation time. 

o utilize a varied set of semantic web, information retrieval 
and image mining technologies. 

• Design SML as a Service Oriented Architecture 
(SOA) to allow individual components to be reused 
and easily integrated into existing NASA’s data and 
information systems. 
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Specific Objectives 


• Three specific semantic middleware core 
components 

o Image retrieval service - uses browse imagery 
to enable discovery of possible new case 
studies and granule metadata to present 
analytics results. 

o Data curation service - uses metadata and 
textual descriptions to find relevant data sets 
and granules needed to support the analysis 
of a phenomena or an event. 

o Semantic rules engine - automates data 
preprocessing and exploratory analysis and 
visualization tasks. 

• Demonstrate value using science use 
cases 
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Explore pathways to infuse this technology into 
existing NASA information and data system 
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Science Use Cases 


• Dust storms, Volcanic Eruptions, Tropical Storms 

• Volcanic Eruptions: 

o Emit a variety of gases as well as volcanic ash, which are in 
turn affected by atmospheric conditions such as winds. 

o Role of Components 

• Image Retrieval Service is used to find volcanic ash 
events in browse imagery 

• Data Curation Service provides the relevant datasets to 
support event analysis 

• Rules Engine invokes a Giovanni processing workflow to 
assemble and compare the wind, aerosol and S02 
data for the vent 


Part 2: Use Case 

Deconstruction 

• • • 

Volcanic Eruptions 
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Conceptual Flow and Data 


Dictionary 
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Phenomena : As commonly used in 
weather observing practice, an 
observable occurrence of particular 
physical 

1. Volcanic eruption 

2. Hurricane 

Event: Instance of an natural 
phenomena 

1.2008 Chaiten Volcanic eruption, 
2.Hurricane Katrina 

Physical Manifestation: feature 
characteristic, the estimation of 
which is the purpose of an 
observation 

Volcano: Ash plume 
Hurricane: 

Wind Fields 

Eye (Atmospheric Pressure) 

Instance (time and space) of 
physical manifestation 

1. 2008 Chaiten ash plume 
2. Wind speeds in and around 
Hurricane Katrina 

Measurements (Observable 
Property): 

How an instrument observes 
Phenomena 

1. Volcanic Eruption: 
S02 Column, 

Aerosol Optical Depth 

2. Hurricane 
Rainrate 

Wind speed/direction 

Data Set Variable: 

Representation of the measurement 
in a data file, variables within an 
actual data file 

OMS02e:ColumnAmountS02_PBL 

MOD08:Optical_Depth_Land_and_ 

OceanJVlean 

Precipitation/Visible Frequencies, 
Pressure 
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DataVariable 
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| ex:atmospheri_concentration_SQ2 1 | ex:surface_temperature | I ex:infrared_radiance I I ex:visible_radiance I I ex:aerosol_optical_dept_thickness I 


Volcanic Eruption: Chaiten 2008 


The Chaiten Volcano seen from a commercial flight, October 2008. It was 
into eruptive phase for the first time in about 9,500 years on the morning of 
May 2, 2008. 

Eruption Time period: May 2 - Nov 2008 

Location: Andes region, Chile ( -42.832778, -72.645833) 










Browse Images 


Band 1-4-3 (true color) 


Band 7-2-1 


Example: MODIS-Aqua 2008-05-03 18:45 UTC 


http://lance-modis.eosdis.nasa.gov/cgi-bin/imagery/realtime.cgi? 

date=2008124 


Example Relevant Data 

Total S0 2 mass: 

e.g. Chaiten is 10 (kt) =(kilotons ) , (lkt= 1000 metric tons) 
ftp://measures.qsfc.nasa.gov/data/s4pa/S02/MSVQI_S02l_4. 1 / 
MSVOLSQ2L4 v01-00-2014m1002.txt 


Daily S02: 

OMI/Aura Sulphur Dioxide ($02) Total Column Daily L2 Global 0.125 deg 

http://disc.sci.asfc.nasa.aov/datacollection/OMSQ2G V003.html 


Calibrated Radiances: 

MODIS/Aqua Calibrated Radiances 5-Min LI B Swath 1 km 
http://dx.d 0 i. 0 rg/l 0.5067/modis/mvd021 km.006 

Aerosol Optical Thickness: 

MODIS/Aqua Aerosol 5-Min L2 Swath 10km 
http://modis-atmos.asfc.nasa.aov/MOD04 L2/ 

SeaWiFS Deep Blue Aerosol Optical Depth and Angstrom Exponent Level 2 
Data 13.5km 

http://disc.asfc.nasa.aov/datacollection/SWDB L2 V004,shtml 


IR Brightness Temperature: 

NCEP/CPC 4-km Global (60 deg N - 60 deg $) Merged IR Brightness 
Temperature Dataset 
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Giovanni S02 Plots 



MODIS-Aqua 2008-05-03 18:45 UTC MODIS-Aqua 2008-05-05 18:30 UTC 
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://gdata2.sci.gsfc.nasa.gov/daac-bin/G3/gui.cgi?instance_id=omil2g 
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Giovanni Infrared Data Plot 


MODIS-Aqua 2008-05-03 18:45 UTC MODIS-Aqua 2008-05-05 18:30 UTC 



Globol Merged IR (OOminl 8Z03MAY2008) 
Created by NASA Goddard GES DISC 



Globol Merged IR (OOminl 7Z05MAY2008) 
Created by NASA Goddard GES DISC 



http://disc.sci.gsfc.nasa.gov/daac-bin/hurricane_data_analysis_tool.pl 
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Part C: Data Curation 

Algorithm for Phenomena 

• • • 

Initial Results 
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Data Curation Algorithm 
Approaches 


• Text mining 

o Pros: Don’t need to explicitly 
define the phenomena 

o Cons: Dependent of the 
truth set; Catalog is 
dynamic and new data 
may never get classified 

• Ontology Based 

o Pros: Best precision and 
recall 

o Cons: Labor intensive to 
build an explicit model 


• Information Retrieval 

o Boolean (Faceted) Search 

• Pros: Simple to implement 

• Cons: Phenomena can be 
complex; User may not 
know all the right keywords 

o Relevancy Ranking Algorithm 

• Pros: List most relevant 
data first 

• Cons: Requires a custom 
algorithm 


Assumptions/Observations 

• Catalog metadata (ECHO) is rich and all metadata 
records have been tagged with appropriate 
vocabulary terms (GCMD) 

• A phenomena can be defined using a bag of keywords 
using vocabulary terms 

o Information need can be captured by using a broad query 

• Keywords (tags) in the metadata and the unstructured 
text (description) can be used 

• Keyword is only used once per metadata record 

o Term frequency does not matter 

• Document frequency for keywords can be used 

o Some keywords may occur in many metadata records 
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Experiment Setup and Approach 


• Randomly select 200 
sample dataset 
metadata from ECHO 

• Label 200 datasets 

o binary: relevant to 
phenomena/not relevant 
to phenomena 
(Hurricane) 

• Compile set of 
keywords (GCMD) 
relevant to Hurricane - 
“bag of words" model 


• Filter 

o Spatial filter 

o Temporal resolution 

• “<= daily” 

o 85 datasets filtered out 

• Apply algorithms on 
remaining 1 15 datasets 

o Jaccard coefficent- 
based ranking 

o Vector Space Model 
using Cosine similarity- 
based ranking 



Algorithms 

Jaccard Coefficient 

J (A,B) = \A DB\/\A U B\ 


Where: 

• A - keywords defining a 
phenomena 

• B - keywords in a given 
dataset 
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Vector Space Model 

Determine term frequency 

(tf): (1 in our case) 

Determine inverse 
document frequency (idf): 
number of metadata 
records that contain the 
keyword 

Calculate Cosine similarity 

o Sum (tf x idf) for each keyword 


Precision 


Retrieval Results 
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Recall 

90 % precision with a 70% recall : 

70% of the relevant data are retrieved with 90% 
precision 




Questions 
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